library(knitr)
library(here)
## here() starts at /Users/blake.feist/Documents/GitHub/VMS-pipeline
This document collates all individual steps of VMS/Fish Ticket processing for a chosen year.
#Kelly adding stuff to see if it commits to the repo
Indicate the PacFIN codes for the fish ticket species you are interested in tracking, as well as the gears. Landings and revenue for these species and gears will be tracked in the pipeline, for both targeted and non-targeted trips.
spp_codes <- c("DCRB")
# here's your options
#https://pacfin.psmfc.org/pacfin_pub/data_rpts_pub/code_lists/sp.txt
gear_codes <- c("CPT","CLP","OTH","OPT")
# https://pacfin.psmfc.org/pacfin_pub/data_rpts_pub/code_lists/agency_gears.txt
Indicate the metric to use in determining targets of fishing trips, as well as the threshold for determining the target of each fishing trip.
# exvessel_revenue for now
pacfin_weight_metric <- "LANDED_WEIGHT_LBS" # another option is "LANDED_WEIGHT_MTONS"
pacfin_revenue_metric <- "EXVESSEL_REVENUE" # another option is AFI_EXVESSEL_REVENUE
# target_metric <- "revenue" # afi_revenue is an alternative option for newer fish tickets, but use caution as you need to know reference year
## how much "more important" does your target need to be than the species with the second greatest catch? Expressed as a ratio.
target_cutoff <- 1
## do you want to report revenue and lbs for the given target species? (dynamic)
# include_target <- TRUE
This parameter will get passed through all processing steps
# Leena will be working through all steps for 2014 as a test
alltime <- proc.time()
process_year <- 2017
We enforce a “lookback window” when matching the VMS and fish tickets such that if there are more than X days [an option that can be changed] between successive tickets for the same vessel, we only include data for those X days. Look for the lookback window option in the master process script.
Create lookback object. Here is where the lookback window duration can be changed. Default is 7 days.
lookback_window <- 7 # in days
year_begin <- lubridate::ymd(paste0(process_year,"-01-01"), tz= "America/Los_Angeles")
year_end <- lubridate::ymd(paste0(process_year,"-12-31"), tz= "America/Los_Angeles")
if(process_year == 2009){
lookback_begin <- year_begin
} else{
lookback_begin <- year_begin - lubridate::days(lookback_window)
}
Date written: 11/2018 Updated: 11/2019 By: M. Fisher, O.R. Liu
This step pre-processes fish tickets from PacFIN for eventual matching to VMS records.
A few key aspects of the processing code:
You can filter out fish tickets with less than a minimum exvessel revenue or landed pounds. This could help remove personal catch, or catch used for monitoring purposes (like domoic acid testing for Dungeness crab). This only removes fish tickets based on total exvessel revenue or total landed pounds, not species-specific minimums.
When calculating species-specific landed pounds and exvessel
revenue as separate columns, the code provides several options. It is
possible to leave in all species listed on the fish ticket, which would
create a different column for each species (around 300; not
recommended). You can also select for species which you are
particularly interested in. These species will each get their own
column, and all other species will be grouped into “other”. If you want
to break down the “other” category by gear type (i.e. other species
caught by pot or trap gear, as is default here), then you can specify
gear types in the gear types vector. Your choice of which
species to highlight for this section will not impact the
calculations of target species. In addition to reporting landed pounds
and revenue for pre-determined species, you can also have a dynamic
column that reports the same information for the target species of that
trip.
Target species is calculated using the maximum landed pounds and exvessel revenue across all possible species caught. A target species will only be identified if the target’s landed pounds / exvessel revenue are more than 10% greater than the next greatest value. Otherwise, the target will be set to “none”. You will need to choose whether you want the target to be calculated using landed pounds v. exvessel revenue. +
All gear types listed for a specific trip are combined into the
columns gear_name_all and gear_code_all,
separated by a “/”. This prevents retention of duplicate fish tickets,
but allows for later filtering.
If you want to reorder the columns in the data frame before writing it out, this can be done in the second to last code chunk, labeled (reorder_dataframe)
The code filters out any fish tickets with missing vessel
identifiers (VESSEL_NUM)
Required packages
library(tidyverse)
library(foreign)
library(qdapTools)
library(lubridate)
library(reshape2)
library(janitor)
library(here)
library(magrittr)
# ggplot theme
plot_theme <- theme_minimal()+
theme(text=element_text(family="sans",size=12,color="black"),
legend.text = element_text(size=14),
axis.title=element_text(family="sans",size=14,color="black"),
axis.text=element_text(family="sans",size=8,color="black"),
panel.grid.major = element_line(color="gray50",linetype=3))
theme_set(plot_theme)
keepers <- c('process_year','alltime','spp_codes','gear_codes','pacfin_weight_metric','pacfin_revenue_metric','target_cutoff','lookback_window','year_begin','year_end','lookback_begin')
rm(list=setdiff(ls(),keepers))
#confidential <- "/Users/kelly.andrews/Documents/GitHub/VMS-pipeline/Confidential"
Choose directories and set objects
## which years of fish tickets are you interested in?
# process_year = 2021
# process_year = 2018 #Create a version of 2018 that has FTID
## which species-specific revenue and lbs columns would you like? (static)
# species_cols <- spp_codes
## calculate an "other" category according to a certain gear type? If so, which gear?
# gear_types <- gear_codes
## would you like target based on exvessel revenue or landed lbs?
# target_metric <- "revenue"
## how much "more important" does your target need to be than the species with the second greatest catch? Expressed as a ratio. (default: target species catch / landings must be 10% greater than second greatest species catch / landings = 1.1)
# target_cutoff <- 1.1
## do you want to report revenue and lbs for the given target species? (dynamic)
# include_target <- TRUE
This should be a .csv file containing raw PacFIN data. We only read in columns that we need.
## for running 2020 and 2021 pipeline, all fish tickets are already together:
rawdat <- read_rds(here::here('Confidential','raw','fish tickets','all_fishtickets_1994_2023.rds'))
rawdat <- rawdat %>%
filter(LANDING_YEAR %in% process_year)
#earlier years:
#tixfiles <- list.files(here::here('data','raw','fish tickets'),full.names = T)
# tixfiles <- list.files(here::here('data','raw','fish tickets','has FTID'),full.names = T)
#
# rawdat <- purrr::map_df(tixfiles,function(fl){
# read_csv(fl,col_types= cols_only(
# FISH_TICKET_ID = col_double(),
# FTID = col_character(), #Also include FTID column to compare between logbooks - won't work if column doesn't appear in ALL raw fish ticket files
# PACFIN_PORT_CODE= col_character(),
# PACFIN_GROUP_PORT_CODE= col_character(),
# VESSEL_NUM= col_character(),
# AGENCY_CODE= col_character(),
# GEAR_CODE= col_double(),
# GEAR_NAME= col_character(),
# PACFIN_GROUP_GEAR_CODE= col_character(),
# REMOVAL_TYPE_CODE= col_character(),
# REMOVAL_TYPE_NAME= col_character(),
# LANDING_DATE= col_character(),
# LANDING_MONTH= col_double(),
# LANDING_YEAR= col_double(),
# PACFIN_SPECIES_CODE= col_character(),
# LANDED_WEIGHT_LBS= col_double(),
# EXVESSEL_REVENUE= col_double()))
# })
#
# rawdat <- rawdat %>%
# filter(LANDING_YEAR %in% process_year)
Check to make sure columns were in the correct format for reading. This will return the names of columns with parsing errors.
problems(rawdat) %>% select(col) %>% distinct()
## # A tibble: 0 × 1
## # ℹ 1 variable: col <int>
# The columns with parsing errors are usually from columns that we do not need anyway
First, subset the raw data to include only the columns that are needed. Rename the columns that will be retained in the final processed data. The last three columns are used to calculate per-species / total revenue and landed weight, but will not ultimately be retained.
rawdat.sub <- rawdat %>%
dplyr::select(FISH_TICKET_ID, FTID, PACFIN_PORT_CODE, PACFIN_GROUP_PORT_CODE, VESSEL_NUM, AGENCY_CODE, GEAR_CODE, GEAR_NAME, PACFIN_GROUP_GEAR_CODE, REMOVAL_TYPE_CODE, REMOVAL_TYPE_NAME, LANDING_DATE, LANDING_MONTH, LANDING_YEAR, PACFIN_SPECIES_CODE, LANDED_WEIGHT_LBS, EXVESSEL_REVENUE) %>%
# change some column names
set_colnames(c("Rec_ID", "FTID", "pacfin_port_code", "port_group_code","drvid", "agency_code","gear_code", "gear_name", "gear_group", "removal_type_code", "removal_type_name", "date", "month", "year",
"PACFIN_SPECIES_CODE", "LANDED_WEIGHT_LBS", "EXVESSEL_REVENUE"))
Remove the columns where the vessel identifier (drvid) is either “UNKNOWN” or blank (““)
rawdat.sub %<>%
filter(drvid != "UNKNOWN") %>%
filter(drvid != "")
Adjust gear group codes for some uncategorized gears
rawdat.sub %<>%
mutate(gear_group=case_when(
gear_group=='MSC' & gear_name %in% c('SPEAR','DIVING - ABALONE IRON','DIVING - RAKE/HOOKS SEA URCHINS','DIVING', 'SHELLFISH DIVER') ~ 'DVG',
gear_group=='MSC' & gear_name %in% c('UNKNOWN','UNKNOWN OR UNSPECIFIED GEAR') ~ 'USP',
gear_group=='MSC' & gear_name %in% c('AQUACULTURE FARM','OYSTER FARM','CLAM FARM') ~ 'FRM',
TRUE ~ gear_group
))
Concatenate all species information for the fish ticket.
all.species <- rawdat.sub %>%
group_by(Rec_ID, removal_type_name) %>%
summarise(species_code_all = ifelse(length(unique(PACFIN_SPECIES_CODE)) > 1, paste(unique(PACFIN_SPECIES_CODE), collapse="/"), as.character(unique(PACFIN_SPECIES_CODE))))
Concatenate the gear information for the fish ticket
gear.info <- rawdat.sub %>%
group_by(Rec_ID, removal_type_name) %>%
summarise(gear_name_all = ifelse(length(unique(gear_name)) > 1, paste(unique(gear_name), collapse="/"), as.character(unique(gear_name))),
gear_code_all = ifelse(length(unique(gear_name)) > 1, paste(unique(gear_code), collapse="/"), as.character(unique(gear_code))))
We need to define the target species for each landed ticket. We will do this by finding the species with the greatest landed revenue for each trip.
Right now, each row of the data is a landing amount for a particular gear/ticket/species combo. We want to collapse these tickets in order to just have one row for each ticket, with an associated amount of landings for the TARGET species.
rawdat.sub %>%
count(Rec_ID) %>%
ggplot(aes(n,..count..))+
geom_histogram(fill="seagreen",bins=30)+
labs(x="Number of records per ticket","kernel density")
Calculate landed pounds and revenue by species for each fish ticket. Then use these data to define proportions of pounds and revenue by species/ticket, in order to designate a “target” species for each fishing trip. We denote the target species for both revenue and pounds landed. If a ticket does not have a definitive target (classified as a proportion of landed pounds or revenue that is >10 percent more than the second-place species), denote “NONE” for the target.
pacfin_weight_metric <- sym(pacfin_weight_metric)
pacfin_revenue_metric <- sym(pacfin_revenue_metric)
rawdat.w.targets <- rawdat.sub %>%
distinct() %>%
# Group by ticket, removal type, and species
group_by(Rec_ID,removal_type_code,removal_type_name,PACFIN_SPECIES_CODE) %>%
# calculate landed pounds and revenue by species
summarise(spp_lbs=sum({{pacfin_weight_metric}}),
spp_revenue=sum({{pacfin_revenue_metric}})) %>%
ungroup() %>%
# now, calculate total pounds per species across the entire ticket
group_by(Rec_ID,PACFIN_SPECIES_CODE) %>%
mutate(tot_lbs_spp=sum(spp_lbs),
tot_revenue_spp=sum(spp_revenue)) %>%
ungroup() %>%
# using these species totals, calculate proportions of total catch belonging to each species
# by lbs landed and revenue
group_by(Rec_ID) %>%
mutate(prop_lbs_spp=tot_lbs_spp/sum(tot_lbs_spp),
prop_revenue_spp=tot_revenue_spp/sum(tot_revenue_spp)) %>%
# finally, assign a TARGET to the trip, defined as the species with the
# LARGEST proportion of revenue for that trip
# If a species landed is not >10% more than the second species, target is NONE
mutate(first_rev=dplyr::first(prop_revenue_spp,order_by = desc(prop_revenue_spp)),
second_rev=dplyr::nth(prop_revenue_spp,n=2,order_by = desc(prop_revenue_spp)),
first_rev_spp=dplyr::first(PACFIN_SPECIES_CODE,order_by= desc(prop_revenue_spp)),
second_rev_spp=dplyr::nth(PACFIN_SPECIES_CODE,n=2,order_by= desc(prop_revenue_spp)),
first_lbs=dplyr::first(prop_lbs_spp,order_by = desc(prop_lbs_spp)),
second_lbs=dplyr::nth(prop_lbs_spp,2,order_by = desc(prop_lbs_spp)),
first_lbs_spp=dplyr::first(PACFIN_SPECIES_CODE,order_by=desc(prop_lbs_spp)),
second_lbs_spp=dplyr::nth(PACFIN_SPECIES_CODE,n=2,order_by= desc(prop_lbs_spp))) %>%
# check if first is >10% more than second, for revenue and landed lbs
# or, if first and second species are the same (i.e. for a ticket with both commercial and personal use catch)
# if so, assign that species as TARGET
mutate(TARGET_rev=ifelse(is.na(first_rev/second_rev)|(first_rev/second_rev)>=target_cutoff|first_rev_spp==second_rev_spp,first_rev_spp,"NONE"),
TARGET_lbs=ifelse(is.na(first_lbs/second_lbs)|(first_lbs/second_lbs)>=target_cutoff|first_lbs_spp==second_lbs_spp,first_lbs_spp,"NONE")) %>%
ungroup() %>%
select(-(first_rev:second_lbs_spp))
# Add back in dates, vessel IDs, etc.##Added FTID in this list
recID_attributes <- rawdat.sub %>%
select(Rec_ID,FTID,PACFIN_SPECIES_CODE,pacfin_port_code,port_group_code,gear_code,gear_name,gear_group,drvid,agency_code,date,month,year)
rawdat.w.targets %<>%
left_join(recID_attributes,by=c("Rec_ID","PACFIN_SPECIES_CODE"))
# add all species and gear types for each ticket
rawdat.w.targets %<>%
left_join(all.species) %>%
left_join(gear.info)
For each fish ticket, sum up an “Other” category based on the choice
made in the beginning of the script (gear_types)
dat_targets_other <- rawdat.w.targets %>%
distinct() %>%
group_by(Rec_ID) %>%
# denote whether a record is in the Other category
mutate(is_other=gear_name %in% gear_codes) %>%
# sum the other category for each ticket
mutate(other_rev=sum(tot_revenue_spp[is_other]),
other_lbs=sum(tot_lbs_spp[is_other]))
Here we add some species-specific columns based on the choice made in
the beginning of the script (species_cols).
# Use pivoting to define pounds and revenue for each species of interest
RecID_spp_cols <- dat_targets_other %>%
select(Rec_ID,PACFIN_SPECIES_CODE,tot_lbs_spp,tot_revenue_spp) %>%
filter(PACFIN_SPECIES_CODE %in% spp_codes) %>%
distinct() %>%
pivot_longer(tot_lbs_spp:tot_revenue_spp) %>%
mutate(type=ifelse(str_detect(name,'lbs'),'lbs','revenue')) %>%
select(-name) %>%
group_by(Rec_ID) %>%
# spread to create the columns
pivot_wider(names_from=c(PACFIN_SPECIES_CODE,type),values_from = value,values_fill = list(value=0))
# add back to main data
dat_targets_other %<>%
left_join(RecID_spp_cols)
If the gear type is a trap or pot or ring, denote this
dat_targets_other %<>%
mutate(TRAP.OR.POT.OR.RING=str_detect(gear_name,"TRAP")|str_detect(gear_name,"POT")|str_detect(gear_name,"RING"))
Change existing date columns to date objects and format.
dat_targets_other %<>%
mutate(date=as.Date(date,"%d-%b-%y")) %>%
mutate(month=lubridate::month(date),
month=month.name[month],
year_mo=paste(year,lubridate::month(date),sep="_"),
jdate=yday(date),
year_jdate=paste(year,jdate,sep="_"),
Week=week(date),
year_Wk=ifelse(Week < 10, paste0(year,"_0",Week), paste0(year,"_",Week)))
We add two columns that denote how many tickets were submitted by a
vessel in a given day. n_tix denotes the total number of
tickets, while n_nonpers_tix denotes the total number of
tickets NOT for personal use
dat_out <- dat_targets_other %>%
# boolean indicating whether a record is for personal use
mutate(is_personal=removal_type_name=="PERSONAL USE") %>%
# for each vessel and date
group_by(drvid,date) %>%
# calculate quantities of interest
mutate(n_tix=n_distinct(Rec_ID),
n_pers=n_distinct(Rec_ID[is_personal]),
n_nonpers_tix=n_tix-n_pers) %>%
select(-n_pers,-is_personal) %>%
ungroup()
Clean up the output by reording the columns of interest
#Added FTID to select list
dat_out %<>%
select(-(spp_lbs:prop_revenue_spp),-is_other) %>%
select(Rec_ID,FTID,pacfin_port_code,port_group_code,drvid,agency_code,removal_type_code,removal_type_name,gear_code_all,TRAP.OR.POT.OR.RING,date,month,year,year_mo,jdate,year_jdate,Week,year_Wk,contains("_lbs"),contains("_rev"),species_code_all,other_lbs,other_rev) %>%
# the last distinct() removes records for which we no longer care about individual sub-trip species catches (we only care about catched of the focal species, plus total catch and TARGETs)
distinct()
Save the data!
#Save a separate version of output with FTID still included
yrs_of_data <- unique(dat_out$year)
for(y in 1:length(yrs_of_data)) {
dat_out %>%
filter(year==yrs_of_data[y]) %>%
write_rds(here::here('Confidential','processed','fish tickets',paste0(yrs_of_data[y],"fishtix_withFTID",".rds",collapse="")))
}
# for(y in unique(rawdat$LANDING_YEAR)){
# year.final.processdat <- filter(final.processdat, year == y )
# write_csv(file = paste0(processdir, "fish tickets ", y, " processed_multispecies03.csv"), x = year.final.processdat, row.names = FALSE)
# cat("wrote out file for ", y, "\n")
# }
x<-proc.time()-alltime
So far, this pipeline for 2017 VMS data has taken 5.57 minutes to run.
Calculate vessel lengths based on vessel registration data, and then add in a few columns to the fish ticket data providing vessel length for each ticket.
We use a decision tree for calculating vessel length from PacFIN data. The template allows for one year or multiple years of fish tickets to be run at the same time.
Currently, the code writes out processed vessel length keys and
combined vessel length/fish ticket landings data as .rds
files. Code to write to .csv maintained but commented
out.
Plotting of outputs is optional, indicated
Step 1: Pull the registration data up to two years prior to the fish ticket year (3 years total)
Step 2: Remove any registration entries with vessels larger than 200 feet and smaller than 10 feet
Step 3: Find the number of unique vessel lengths for all unique Vessel Number / Agency Code combinations.
Step 4: Calculate final vessel length for all unique Vessel Number / Agency Code combinations
If vessel length data was available for 2+ years and MAX_LENGTH < MIN_LENGTH + 10, then assume this reflects an actual increase or decrease in vessel size.Final vessel length is mean of the two most recent vessel lengths
If vessel length data was available for only one year OR vessel length data was available for 2+ years, but MAX_LENGTH > MIN_LENGTH + 10. Pull registration data up to four years prior to the fish ticket year (5 years total). Then, if…
Calculation of vessel lengths for X years of fish tickets.
Summary of calculation output: proportion of fishing vessels missing lengths (per year), distribution of vessel lengths (per year and overall), vessel length against the number of years vessel length was recorded (per year), and the frequency of the calculation types used (per year and overall).
Application of calculated vessel lengths to appropriate fish tickets.
Resources:
PMSC Fleet Report, esp Table 8 PacFIN Column names key
Load packages
library(tidyverse)
library(here)
library(magrittr)
library(knitr)
# ggplot theme
plot_theme <- theme_minimal()+
theme(text=element_text(family="sans",size=12,color="black"),
legend.text = element_text(size=14),
axis.title=element_text(family="sans",size=14,color="black"),
axis.text=element_text(family="sans",size=8,color="black"),
panel.grid.major = element_line(color="gray50",linetype=3))
theme_set(plot_theme)
keepers <- c('process_year','alltime','spp_codes','gear_codes','pacfin_weight_metric','pacfin_revenue_metric','target_cutoff','lookback_window','year_begin','year_end','lookback_begin')
rm(list=setdiff(ls(),keepers))
options(dplyr.summarise.inform=F)
Source functions for vessel length calculations.
source(here::here('process steps',"Report_Vessel_Length_processedtix_functions.R"))
Enter processing options: Year desired to be processed, whether to only process Dungeness vessels, and whether to plot outputs.
# process_year <- 2020
plot_output <- TRUE
Load permits data
permits <- read_csv(here('Confidential','raw','vessel info','2009_2023_vesselreg.csv'))
Choose columns of interest and subset the data frame
pcols <- c("VESSEL_NUM", "AGENCY_CODE","REGISTRATION_YEAR","VESSEL_LENGTH", "LENGTH_TYPE_CODE")
pthin <- permits %>% select(all_of(pcols))
Change all entries of 777 for vessel length to
NA (based on recommendation from Blake Feist)
pthin %<>%
mutate(VESSEL_LENGTH=na_if(VESSEL_LENGTH,777))
Remove any vessel lengths greater than 200 feet or lower than 10 feet. Note - depending on the fishery, may want to change these cutoffs
pthin_length_filter <- pthin %>%
filter(VESSEL_LENGTH < 200,VESSEL_LENGTH > 10)
Read in raw fish ticket .rds files for each year of
interest to data frame “landings”.
# import landings data
#Adjust import file name - use one that has FTID columns
landings <- purrr::map(process_year, function(y){
read_rds(here::here('Confidential','processed','fish tickets',paste0(y,'fishtix_withFTID.rds')))
}) %>%
bind_rows()
# find relevant vessels
vessels <- landings %>%
filter(drvid != "UNKNOWN", drvid != "MISSING", drvid != "", !is.na(drvid)) %>%
select(drvid, agency_code, year) %>%
distinct()
Initiate empty data frame for all vessel length data across years
vessel_length_key_df <- tibble("drvid" = character(),
"agency_code" = character(),
"year" = numeric(),
"FINAL_LENGTH" = numeric(),
"TYPE_CALC" = character(),
"UNIQUE_LENGTHS" = numeric(),
"N_YEARS_LENGTH_RECORDED" = numeric(),
"HISTORIC_DATA" = character())
Do the calculation of vessel lengths.
for(y in unique(vessels$year)){
# take only the vessels fishing in year "y"
year_vessels <- filter(vessels, year == y)
# identify the years of regulation data to pull
target_reg_years <- seq(y-2, y)
# subset for target registration years and summarise permits for each vessel
cat("Calculating 3yr summary statistics for vessels in ", y, "\n")
pthin_sumstats <- pthin_length_filter %>%
filter(REGISTRATION_YEAR %in% target_reg_years,!is.na(VESSEL_LENGTH)) %>%
group_by(VESSEL_NUM, AGENCY_CODE) %>%
arrange(desc(REGISTRATION_YEAR)) %>%
summarise(n_lengths = length(VESSEL_LENGTH),
n_unique = length(unique(VESSEL_LENGTH)),
max_length = max(VESSEL_LENGTH),
min_length = min(VESSEL_LENGTH),
mean2yr = get2yrmean(x=VESSEL_LENGTH, years=REGISTRATION_YEAR)) %>%
ungroup()
# create empty vectors for this year
final_vessel_lengths <- c()
length_calc_vec <- c()
n_unique_vec <- c()
n_lengths_vec <- c()
historic_vec <- c()
processed <- 0
cat("Calculating vessel lengths for fishing vessels in ", y, "...\n")
# for each vessel fishing in this year #
for(i in seq(1:length(year_vessels$drvid))){
## use the calc_length function (loaded from the "functions.R" file) to calculate vessel length
tmp_vessel_length_info <- calc_length(permits=permits, vesseldat = year_vessels, lengthdat = pthin_length_filter, summarydat = pthin_sumstats, index = i)
## save the calc_length output to the appropriate position ("i") in the output vectors for this year
n_lengths_vec[i] <- tmp_vessel_length_info[1] %>% as.numeric()
n_unique_vec[i] <- tmp_vessel_length_info[2] %>% as.numeric()
final_vessel_lengths[i] <- tmp_vessel_length_info[3] %>% as.numeric()
length_calc_vec[i] <- tmp_vessel_length_info[4]
## if the vessel had to be calculated with historical data from over 5 years ago, a warning message will be saved in the calc_length output
if(length(tmp_vessel_length_info) > 4){
### print the warning message
print(tmp_vessel_length_info[5])
### save "Y" to the historic_vec for this year
historic_vec[i] <- "Y"
} else{ historic_vec[i] <- "N" }
processed <- processed + 1
}
cat("Done processing", processed, "vessels for", y, "\n")
# save allof the output vectors to a data frame for this year
tmp_vessel_length_key_df <- tibble("drvid" = year_vessels$drvid,
"agency_code" = year_vessels$agency_code,
"year" = year_vessels$year,
"FINAL_LENGTH" = final_vessel_lengths,
"TYPE_CALC" = length_calc_vec,
"UNIQUE_LENGTHS" = n_unique_vec,
"N_YEARS_LENGTH_RECORDED" = n_lengths_vec,
"HISTORIC_DATA" = historic_vec)
## bind this year's data frame to the end of the full data frame
vessel_length_key_df <- bind_rows(vessel_length_key_df, tmp_vessel_length_key_df)
cat("Wrote out", dim(tmp_vessel_length_key_df)[1], "lengths for", y, " to final data frame\n\n")
}
## Calculating 3yr summary statistics for vessels in 2017
## Calculating vessel lengths for fishing vessels in 2017 ...
## [1] "ERROR: Vessel WRN00957 W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WRN11117 W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN9205SB W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WRN02096 W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel WN7260SH W had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel WN7260SH W had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel WRN00953 W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WRN00978 W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel OR399HX W COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel OR538EG O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel 696433 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN37515 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN0691RU O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel OR243FM O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN4728LH O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN18612 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel OR047AFN O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel 952234 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN5449RR O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel OR994SG O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN8204RU O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel OR399HX O had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel OR399HX O had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel WN19130 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel OR77AFC O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel WN772CK O had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel WN772CK O had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 534685 O had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 534685 O had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel WN7503S O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel OR141HX O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN6698NV O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel 978840 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel WN7644SF O had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel WN7644SF O had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel 946771 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN9202 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel WN26129 O COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel 536546 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 536546 C had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel CF8848NA C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel 1221588 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1221588 C had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel CF5533SC C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel CF2679KS C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF2679KS C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1279881 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1279881 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4231SJ C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4231SJ C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF7068TX C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF7068TX C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1151438 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1151438 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF2060ZM C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF2060ZM C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF2159VH C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF2159VH C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF5919RA C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF5919RA C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 977025 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 977025 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF1052KR C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF1052KR C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 611262 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 611262 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF5296SY C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF5296SY C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF1566KA C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF1566KA C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4268KS C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4268KS C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF8939PK C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF8939PK C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF5228UR C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF5228UR C had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel CF2039HF C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel CF4486SA C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4486SA C had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel CF8026SS C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel CF2389KZ C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF2389KZ C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel AK9680AN C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel AK9680AN C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 579922 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 579922 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 585848 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 585848 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF2227GM C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF2227GM C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4229UC C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4229UC C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF6493KN C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF6493KN C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 531799 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 531799 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1208783 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1208783 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF1148KL C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF1148KL C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF3362UK C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF3362UK C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4485RM C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4485RM C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1181484 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1181484 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF8056UF C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF8056UF C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF5596RG C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF5596RG C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF7236KW C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF7236KW C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF3416HG C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF3416HG C had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel CF1012PM C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel CF3582TT C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel CF0281GS C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel CF9887KZ C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF9887KZ C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4299UC C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4299UC C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF8696LE C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF8696LE C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF9424UV C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF9424UV C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1245219 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1245219 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF0754JF C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF0754JF C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF9604RW C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF9604RW C had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel CF7342SK C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel CF8508VH C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF8508VH C had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel CF9572HE C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel CF7141TR C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF7141TR C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1276159 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1276159 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF2459RD C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF2459RD C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF8966RU C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF8966RU C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF8803UB C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF8803UB C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF2380SX C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF2380SX C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF1266RJ C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF1266RJ C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF8772LE C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF8772LE C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF5397JY C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF5397JY C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4489ER C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4489ER C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4258NY C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4258NY C had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel CF9756ZV C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel CF9116HY C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF9116HY C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF8038SC C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF8038SC C had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel CF5004UF C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel CF9484JU C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF9484JU C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1253177 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1253177 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF0267NK C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF0267NK C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF0450VP C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF0450VP C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1200264 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1200264 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1126960 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1126960 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF1265KJ C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF1265KJ C had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel CF5179RY C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel CF7649TM C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "ERROR: Vessel CF7906JB C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel CF4977TS C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4977TS C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF6122PW C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF6122PW C had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel CF0154RB C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel 1276460 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 1276460 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4596TU C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF4596TU C had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel CF9676PF C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel 951387 C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel 951387 C had to be calculated with data > 4 years prior to 2017"
## [1] "ERROR: Vessel CF3354FZ C COULD NOT BE FOUND IN PERMITS DATA BASE."
## [1] "WARNING: Length for vessel CF3334UK C had to be calculated with data > 4 years prior to 2017"
## [1] "WARNING: Length for vessel CF3334UK C had to be calculated with data > 4 years prior to 2017"
## Done processing 3658 vessels for 2017
## Wrote out 3658 lengths for 2017 to final data frame
How many fishing vessels WITH Vessel Numbers are missing calculated lengths?
cat(sum(is.na(vessel_length_key_df$FINAL_LENGTH)), 'vessels with Vessel ID numbers are missing calculated lengths, or', sum(is.na(vessel_length_key_df$FINAL_LENGTH)) / length(vessel_length_key_df$FINAL_LENGTH)*100, 'percent.')
## 45 vessels with Vessel ID numbers are missing calculated lengths, or 1.23018 percent.
for(y in unique(vessel_length_key_df$year)){
for(a in unique(vessel_length_key_df$agency_code)){
tmp_dat <- vessel_length_key_df %>%
filter(agency_code == a) %>%
filter(year == y)
missing <- sum(is.na(tmp_dat$FINAL_LENGTH))
cat("Number", y, "Vessels Missing Vessel Lengths for", a, ":", missing, "\n")
cat("Proportion:", missing/length(tmp_dat$FINAL_LENGTH), "\n\n")
}
}
## Number 2017 Vessels Missing Vessel Lengths for W : 7
## Proportion: 0.007376185
##
## Number 2017 Vessels Missing Vessel Lengths for O : 21
## Proportion: 0.02162719
##
## Number 2017 Vessels Missing Vessel Lengths for C : 17
## Proportion: 0.009781358
# ggplot(data=filter(vessel_length_key_df, !is.na(FINAL_LENGTH)), aes(x=as.numeric(FINAL_LENGTH),y=as.numeric(N_YEARS_LENGTH_RECORDED))) +
# geom_point()+
# facet_wrap(~year) +
# xlab("Calculated Vessel Length") +
# ylab("Number of Years Length Was Recorded") +
# theme(axis.text.x = element_blank())
# Frequency table of number of years' records
vessel_length_key_df %>%
count(N_YEARS_LENGTH_RECORDED) %>%
kable(col.names = c("Number of Records","Number of Vessels"))
| Number of Records | Number of Vessels |
|---|---|
| 1 | 258 |
| 2 | 1721 |
| 3 | 1600 |
| 4 | 25 |
| 5 | 5 |
| 6 | 1 |
| 7 | 1 |
| 8 | 1 |
| 9 | 1 |
| NA | 45 |
If plotting option above is TRUE, make some informative
plots.
if(plot_output){
vessel_length_key_df %>%
filter(!is.na(FINAL_LENGTH)) %>%
ggplot(aes(as.numeric(FINAL_LENGTH))) +
geom_histogram(bins=10) +
facet_wrap(~year) +
labs(x="Vessel Length",y="Number of Vessels",title="Vessel Length Frequency Distribution")
}
if(plot_output){
vessel_length_key_df %>%
ggplot(aes(vessel_length_key_df$TYPE_CALC)) +
geom_bar() +
facet_wrap(~year) +
labs(x="Calculation Type",y="Number of Vessels")+
theme(axis.text.x = element_text(angle=90, hjust=1))
}
For each year, join vessel length data with landings. How many fish tickets are missing a vessel length, by agency? Non-specific.
for(y in unique(vessel_length_key_df$year)){
landings_subset <- landings %>%
filter(year == y)
dcrb_landings_length <- left_join(landings_subset, filter(vessel_length_key_df, year == y), by=c("drvid", "agency_code"))
for(a in unique(dcrb_landings_length$agency_code)){
tmp_dat <- filter(dcrb_landings_length, agency_code == a)
missing <- sum(is.na(tmp_dat$FINAL_LENGTH))
cat("Number of", y, "fish tickets missing vessel lengths for", a, ":", missing, "\n")
cat("Proportion:", missing/length(tmp_dat$FINAL_LENGTH), "\n\n")
}
}
## Number of 2017 fish tickets missing vessel lengths for W : 419
## Proportion: 0.02136229
##
## Number of 2017 fish tickets missing vessel lengths for O : 400
## Proportion: 0.01969958
##
## Number of 2017 fish tickets missing vessel lengths for C : 298
## Proportion: 0.005442624
Vessel lengths key
for(y in unique(vessel_length_key_df$year)){
key_subset <- vessel_length_key_df %>%
filter(year == y)
write_rds(key_subset,here::here('Confidential','processed','vessel length keys',paste0("vessel_length_key_", y, ".rds")))
}
Landings matched to vessel lengths
#save output with diff name to indicate FTID still in data
for(y in unique(vessel_length_key_df$year)){
landings_subset <- landings %>%
filter(year == y)
vessel_length_key_df_subset <- vessel_length_key_df %>%
filter(year==y) %>%
select(-year)
landings_length <- left_join(landings_subset,vessel_length_key_df_subset, by=c("drvid", "agency_code"))
write_rds(landings_length,here::here('Confidential','processed','fish tickets',paste0(y, "fishtix_vlengths_withFTID.rds")))
}
x<-proc.time()-alltime
So far, this pipeline for 2017 VMS data has taken 6.08 minutes to run.
This step in data processing further cleans raw VMS data.
The following changes needed to be made for further processing:
Add in Cartesian geocoordinates (after checking for missing lat/lon)
Add in unique record numbers for each VMS record
Remove VMS records outside of a given latitude (assumed to be non-West Coast)
Add in the bathymetry layer
Remove high elevation points
Reorder columns
To process data, user can choose year and lat/lon bounds of VMS data points (currently set conservatively around US West Coast waters). Additionally, choose an upper bound for the bathymetry layer. Conservatively set at +100 ft., which should certainly retain all ocean areas.
# process_year = 2021
lon_upper = -117 # if you do not want this filter, set to NA
lon_lower = -132 # if you do not want this filter, set to NA
lat_lower = 32 # if you do not want this filter, set to NA
lat_upper = 50 # if you do not want this filter, set to NA
bathy_upper = 100
# Updated 2/21/2024 with new VMS files
vms_fn <- list.files(here('Confidential','raw','vms'),full.names = T)
vms_we_want <- vms_fn[grepl("chunk",vms_fn)&grepl(process_year,vms_fn)]
vms_raw <- purrr::map_df(vms_we_want,read_rds) %>%
filter(year==process_year)
Are any records missing lat/lon?
missing_lat <- sum(is.na(vms_raw$LAT)); missing_lat
## [1] 0
# if(missing_lat > 0){
# View(vms_raw %>% filter(is.na(LAT)))
# }
AFTER CHECKING ORIGINAL OLE FILE, delete the missing record(s).
vms_raw %<>% filter(!is.na(LAT))
Some of the latitude measurements are listed as negatives (likely due to the parsing process). Check how many observations this affects, and then change them to positive.
cat(sum(vms_raw$LAT<0),'records have negative latitude.')
## 39874 records have negative latitude.
vms_raw %<>% mutate(LAT = ifelse(LAT > 0, LAT, LAT*-1))
Convert lat/long to cartesian geocoordinates for UTM zone 10.
vms_coords <- vms_raw %>%
# convert to simple features point object
st_as_sf(coords=c("LON","LAT"),crs="+proj=longlat +datum=WGS84") %>%
# project to UTM zone 10
st_transform(crs = "+proj=utm +north +zone=10 +ellps=WGS84") %>%
# extract XY coordinates
st_coordinates()
# add to data frame
vms_raw %<>%
mutate(X_COORD = vms_coords[,1],
Y_COORD = vms_coords[,2])
# validate with back-conversion to lat/long for a random selection of 10 records
test_coords <- vms_raw %>%
sample_n(10) %>%
dplyr::select(LON, LAT, X_COORD,Y_COORD)
test_coordsLL <- test_coords %>%
st_as_sf(coords=c("X_COORD","Y_COORD"),crs="+proj=utm +north +zone=10 +ellps=WGS84") %>%
st_transform(crs="+proj=longlat +datum=WGS84") %>%
st_coordinates()
# test whether coordinates are the same to 4 decimal places
all(round(test_coords[,c(1,2)],digits = 4)==round(test_coordsLL,digits=4))
## [1] TRUE
Add unique ID numbers for VMS records.
VMS_recno: select random integer values between X and Y, sampling without replacement. CHECK PREVIOUS YEARS OF DATA TO MAKE SURE THERE ARE NO DUPLICATE RECORD NUMBERS
set.seed(0401) # set a RNG seed so that we have reproducibility if/when code is re-run
vms <- vms_raw %>%
ungroup() %>%
mutate(VMS_RECNO=row_number()+process_year*1e7) %>%
mutate(VMS_RECNO=sample(VMS_RECNO))
# Did we make any duplicates for this year (by accident)?
!length(unique(vms$VMS_RECNO))==nrow(vms)
## [1] FALSE
# Nope, all VMS_RECNO unique
This will delete records above the US-Canadian border
(lat_upper) and below the US-Mexican border
(lat_lower).
dim(vms)
## [1] 7178954 18
if(!is.na(lat_upper)){
vms <- filter(vms, LAT < lat_upper)
}
if(!is.na(lat_lower)){
vms <- filter(vms, LAT > lat_lower)
}
dim(vms)
## [1] 6033231 18
This will delete records far out to sea, or inland.
dim(vms)
## [1] 6033231 18
if(!is.na(lon_upper)){
vms <- filter(vms, LON < lon_upper)
}
if(!is.na(lon_lower)){
vms <- filter(vms, LON > lon_lower)
}
dim(vms)
## [1] 5975957 18
Duplicates are any VMS records with the same: UTC_TIME, LAT, LON, VESSEL_NAME, DOCNUM.
Create data frame where duplicated record (second record) is removed.
dim(vms)
## [1] 5975957 18
tm <- proc.time()
vms_nodup <- vms %>%
distinct(UTC_TIME,LAT,LON,VESSEL_NAME,DOCNUM,.keep_all = TRUE)
proc.time()-tm
## user system elapsed
## 1.230 0.133 1.364
cat("Proportion of VMS records removed for being true duplicate records:", 1-nrow(vms_nodup)/nrow(vms))
## Proportion of VMS records removed for being true duplicate records: 0.01046075
Save the duplicate entries to a file, to understand what data is being removed!
# janitor::get_dupes()
tm <- proc.time()
vms_dupes <- vms %>%
get_dupes(UTC_TIME,LAT,LON,VESSEL_NAME,DOCNUM) %>%
arrange(DOCNUM, UTC_TIME) %>%
dplyr::select(dupe_count,everything())
proc.time()-tm
## user system elapsed
## 42.145 1.756 43.914
write_rds(vms_dupes, here::here('Confidential','processed','vms',paste0(process_year, "_duplicates_only.rds")))
Read in the bathymetry object: Blake Feist’s 3 arc-second composite bathymetry
bathy.grid <- rast(here::here('data','bathymetry','composite_bath.tif'))
Get bathymetry at VMS data points
vms_sp <- vms_nodup %>% st_as_sf(coords=c("LON","LAT"),crs=4326)
vmsll <- st_coordinates(vms_sp)
bathy.points <- terra::extract(bathy.grid,vmsll)/10# convert to meters from decimeters
vms_nodup <- mutate(vms_nodup, NGDC_M = bathy.points[,1])
Remove high elevation bathymetry.
vms_nodup_bathy <- vms_nodup %>% filter(NGDC_M < bathy_upper)
cat('Filtering out records greater than',bathy_upper,'meters in elevation resulted in the removal of',nrow(vms_nodup)-nrow(vms_nodup_bathy),'records (',(nrow(vms_nodup)-nrow(vms_nodup_bathy))/nrow(vms_nodup)*100,'percent ).')
## Filtering out records greater than 100 meters in elevation resulted in the removal of 79245 records ( 1.340082 percent ).
Plot points. We plot a subset of 100 thousand VMS records to make plotting time reasonable.
vms_sf <- vms_nodup_bathy %>% st_as_sf(coords=c("X_COORD","Y_COORD"),crs = "+proj=utm +north +zone=10 +ellps=WGS84")
# Plotting takes a long time
# Try with a subset of 100k points
vms_sf_sample <- vms_sf %>%
sample_n(100000) %>%
filter(NGDC_M>-100000)
# coastline
coaststates <- ne_states(country='United States of America',returnclass = 'sf') %>%
filter(name %in% c('California','Oregon','Washington')) %>%
st_transform(st_crs(vms_sf))
ggplot()+
geom_sf(data=coaststates,fill='gray50',col=NA)+
geom_sf(data=vms_sf_sample,size=0.5,col='blue')+
labs(x='Longitude',y='Latitude',title=paste0(process_year," VMS records"))
vms_ordered <- vms_nodup_bathy %>%
dplyr::select(UTCDATETIME,LAT,LON,NGDC_M,VESSEL_NAME,AVG_SPEED,AVG_COURSE,DOCNUM,DECLARATION_CODE,X_COORD,Y_COORD,VMS_RECNO)
write_rds(vms_ordered,here::here('Confidential','processed','vms',paste0(process_year,'_vms_clean.rds')))
x<-proc.time()-alltime
So far, this pipeline for 2017 VMS data has taken 8.35 minutes to run.
This portion of the data processing matches the cleaned PacFIN fish tickets to the cleaned VMS data. We match the two datasets together using unique vessel identification numbers that are common to both datasets. We define a fishing “trip” as the collection of VMS observations directly preceding a landed fish ticket. Hence, all of the VMS pings that occur between two tickets are considered part of the latter ticket’s trip. We enforce a “lookback window” such that if there are more than X days [an option that can be changed] between successive tickets for the same vessel, we only include data for those X days. Look for the lookback window option in the master process script.
Later, the data are filtered and cleaned for depth, unrealistic speeds between pings, and corrected for multiple-ticket days.
Choose year to process. (this is usually done in the top level script)
# process_year <- 2021
Load fish ticket data (processed), including the last month of the previous year (for handling trips spanning January 1st of the current year)
#Load in data with FTID and vessel length
fishtix <- read_rds(here::here('Confidential','processed','fish tickets',paste0(process_year,'fishtix_vlengths_withFTID.rds')))
#Note tho that atm some of the previous years don't have length and FTID as I haven't run that
if(process_year>2009){
fishtix_prev <- read_rds(here::here('Confidential','processed','fish tickets',paste0(process_year-1,'fishtix_withFTID.rds'))) %>%
filter(month(date)==max(month(date)))
fishtix <- bind_rows(fishtix_prev,fishtix)
rm(fishtix_prev)
}
Load VMS data, including the last month of the previous year (for handling trips spanning January 1st of the current year)
vms <- read_rds(here::here('Confidential','processed','vms',paste0(process_year,'_vms_clean.rds')))
if(process_year>2009){
vms_prev <- read_rds(here::here('Confidential','processed','vms',paste0(process_year-1,'_vms_clean.rds'))) %>%
filter(month(UTCDATETIME)==max(month(UTCDATETIME)))
vms <- bind_rows(vms_prev,vms)
rm(vms_prev)
}
PacFIN reports in west coast time, so create two columns with date/time in Pacific time zone
vms %<>%
# with time (hour, minute)
mutate(westcoastdate = with_tz(UTCDATETIME, tzone = "America/Los_Angeles")) %>%
# without time
mutate(westcoastdate_notime = as_date(westcoastdate))
Truncate fish ticket data to match VMS year, with the lookback window where appropriate. This should prevent fish tickets from the previous year being assigned the same VMS data points.
fishtix %<>%
filter(date >= lookback_begin & date <= year_end) %>%
#remove unneeded date columns
#Edited to keep FTID and final_length columns
select(Rec_ID:date,TARGET_lbs:FINAL_LENGTH)
Add begin and end dates for each ticket to use in VMS matching. These define the windows of time in which we “search” for matching VMS pings. We also add a column for trip duration in days, for use in QA later.
ptm <- proc.time()
fishtix %<>%
# group by vessel
group_by(drvid) %>%
#add month
mutate(month=month(date)) %>%
# add a time for the end of day
mutate(end = ymd_hms(paste(date, "23:59:59"),tz= "America/Los_Angeles")) %>%
# arrange(date) %>%
# find the ticket BEFORE each ticket and note its time. Instead of creating NAs for the first ticket of the year for each vessel, assign a default value of the first ticket minus the lookback window.
mutate(prev_ticket_dttm=lag(end,order_by=end,default=first(end,order_by = end)-days(lookback_window))) %>%
# add lookback window
mutate(lookback_dttm=end-days(lookback_window)) %>%
# choose "begin" date for each ticket as the LATEST in time between previous ticket date and lookback date
mutate(begin=ifelse(lookback_dttm>prev_ticket_dttm,lookback_dttm,prev_ticket_dttm) %>% as_datetime(tz= "America/Los_Angeles")) %>%
# add trip duration
mutate(trip_dur=interval(begin,end)) %>% mutate(trip_dur=time_length(trip_dur,"days")) %>%
# select(-prev_ticket_dttm,-lookback_dttm) %>%
ungroup()
x<-proc.time()-ptm
cat('Defining lookback window for all data took',x[3]/60,'minutes to run.')
## Defining lookback window for all data took 0.57695 minutes to run.
Here is where we do the actual matching of VMS data to fish tickets. Matching is done by vessel ID number and time, which are recorded in both datasets. We first join all VMS data for each vessel, then filter such that only the records within the appropriate window of time for each ticket (defined in the previous step) are retained.
ptm <- proc.time()
join_vms_fishtix <- function(mth){
tmptix <- fishtix %>%
filter(month==mth)
tmpvms <- vms %>% filter(month(westcoastdate)%in%c(mth-1,mth))
tmptix %>%
# Left-join the VMS data by Vessel ID (drvid/DOCNUM)
left_join(tmpvms,by=c('drvid'='DOCNUM')) %>%
# filter the VMS data such that the data for each trip is correctly truncated to fall within the lookback window.
# If any tickets did NOT match VMS data, they will not have associated dates from VMS.
# We retain these, and then produce a boolean indicator denoting that they did not match.
filter((westcoastdate >= begin & westcoastdate <= end)| is.na(westcoastdate)) %>%
mutate(has_vms=ifelse(is.na(westcoastdate),0,1)) %>%
ungroup()
}
fishtix_vms <- purrr::map_df(unique(fishtix$month),~join_vms_fishtix(.)) %>%
# finally, filter the edge case such that we remove trips ENDING before January 1st.
# this will retain the VMS records that are before Jan 1 for any trips ending on or after Jan 1
filter(year(date)==process_year)
## Join vms to fishtix
# fishtix_vms <- fishtix %>%
#
# # Left-join the VMS data by Vessel ID (drvid/DOCNUM)
# left_join(vms,by=c('drvid'='DOCNUM')) %>%
#
# # filter the VMS data such that the data for each trip is correctly truncated to fall within the lookback window.
# # If any tickets did NOT match VMS data, they will not have associated dates from VMS.
# # We retain these, and then produce a boolean indicator denoting that they did not match.
#
# filter((westcoastdate >= begin & westcoastdate <= end)| is.na(westcoastdate)) %>%
# mutate(has_vms=ifelse(is.na(westcoastdate),0,1)) %>%
# ungroup()
#
x<-proc.time()-ptm
cat('Joining VMS data and filtering took',x[3]/60,'minutes to run for',process_year)
## Joining VMS data and filtering took 0.4221833 minutes to run for 2017
We now have the full matched dataset, including retaining 69045 fish ticket records with no associated VMS data.
# test<- fishtix_vms %>%
# select(Rec_ID,trip_dur,has_vms) %>%
# mutate(trip_dur=as.integer(trip_dur)) %>%
# distinct() %>%
# ungroup() %>%
# mutate(has_vms=ifelse(has_vms,"VMS Trips","Non-VMS Trips"))
# How long are trips?
fishtix_vms %>%
select(Rec_ID,trip_dur,has_vms) %>%
mutate(trip_dur=as.integer(trip_dur)) %>%
distinct() %>%
ungroup() %>%
mutate(has_vms=ifelse(has_vms,"VMS Trips","Non-VMS Trips")) %>%
ggplot(aes(trip_dur,fill=has_vms))+
geom_bar(position = 'dodge')+
scale_x_discrete(limits=seq(0,7,by=1))+
labs(x="Trip Duration (Days)",y="Number of Trips",title="Trip Duration",fill="")
# Count how many trips have multiple species (measured as multiple targets)
fishtix_vms %<>% mutate(multispecies=ifelse(nchar(species_code_all)>4,1,0))
fishtix_vms %>%
select(Rec_ID,multispecies,has_vms) %>%
distinct() %>%
mutate(has_vms=ifelse(has_vms,"VMS Trips","Non-VMS Trips")) %>%
ggplot(aes(factor(multispecies),fill=factor(multispecies)))+
scale_fill_discrete(labels=c('No',"Yes"))+
geom_bar()+
facet_wrap(~has_vms)+
labs(x="",y="Number of Fish Tickets",title="Number of Multispecies Trips",fill="Multispecies\nTrip")+
theme(axis.text.x = element_blank())
# Distribution of number of trips per vessel, VMS vs. non-VMS records
fishtix_vms %>%
group_by(drvid,has_vms) %>%
summarise(ntix=length(unique(Rec_ID))) %>%
ungroup() %>%
mutate(has_vms=ifelse(has_vms,"VMS Vessels","Non-VMS Vessels")) %>%
ggplot(aes(ntix,after_stat(count)))+
geom_histogram(bins=10)+
facet_wrap(~has_vms)+
labs(x=paste("Number of Trips in",process_year),y="Number of Vessels",title="Distribution of Number of Recorded Trips per Vessel")
# adjust for inflation based on reference year using Fred deflators (added 8May2024)
# left_join(gdp_defl, by = YEAR) %>% #left_join(gdp_defl, by = c("year in fishtix_vms name" = "YEAR")
# mutate(AFI_REVENUE = EXVESSEL_REVENUE / DEFL)
Save a version that has all fish ticket data, including records with no VMS match, as well as a filtered dataset that excludes non-VMS records.
fishtix_vms %<>% select(-month)
fishtix_vms_all <- fishtix_vms
fishtix_vms_only <- fishtix_vms %>% filter(has_vms==1)
#Save versions that have FTID and length
write_rds(fishtix_vms_all,here::here('Confidential','processed','matched','matching',paste0(process_year,'matched_alltix_withFTID.rds')))
write_rds(fishtix_vms_only,here::here('Confidential','processed','matched','matching',paste0(process_year,'matched_vmstix_only_withFTID.rds')))
x<-proc.time()-alltime
# cat('This step for',process_year,'VMS data took',round(x[3]/60,2),'minutes to run.')
So far, this pipeline for 2017 VMS data has taken 9.88 minutes to run.
This final step cleans up the matched fish ticket/VMS records spatially through filtering. The script:
Removes outlier trips where the Port of Landing is > 50km from the last VMS point of the associated trip
Recalculates average speed for each trip and filters out segments with unrealistically large speed. This speed cutoff is set by default at 20 m/s, but can be altered in this script.
Flags VMS points that have a bathymetry value > 0
Flags VMS points that are within a 3-kilometer buffer of a port, so we can remove points that are not directly associated with a fishing trip (e.g., because the boat is sitting in port between trips)
First, the full vms data set for 2017.
vms <- read_rds(here::here('Confidential','processed','matched','matching',paste0(process_year,'matched_vmstix_only_withFTID.rds')))
Latitude/longitude of ports. Coordinates were provided by Blake Feist.
portlist_coords <- read_csv(here::here('data','port_coords_fromBlake_edited.csv'),col_types='cddd') %>%
select(port_code,Lon,Lat) %>%
set_names(c('port_code','portlon','portlat'))
Filter VMS data to include only the last VMS data point for each trip
fishtix_lastVMS <- vms %>%
group_by(Rec_ID) %>%
top_n(1, wt=westcoastdate) %>%
ungroup()
Add port locations to last VMS points
fishtix_lastVMS <- left_join(fishtix_lastVMS, portlist_coords, by=c("pacfin_port_code" = "port_code"))
How many port lat/lon coordinates are NA? For 2017, 1.95 percent of trips are missing coordinates. We remove these observations.
Note: why is this happening?
fishtix_lastVMS %<>% filter(!is.na(portlat),!is.na(portlon))
Calculate distance with geosphere
port_dists <- geosphere::distHaversine(p1=cbind(fishtix_lastVMS$portlon, fishtix_lastVMS$portlat),
p2=cbind(fishtix_lastVMS$LON, fishtix_lastVMS$LAT))
Add the distances as another column
fishtix_lastVMS %<>%
mutate(port_to_VMS = port_dists / 1000)
trips_keep_remove_portdist <- fishtix_lastVMS %>%
mutate(keep_remove_portdist= ifelse(port_to_VMS <= 50, 'keep','remove'))
trips_to_keep <-trips_keep_remove_portdist %>%
filter(keep_remove_portdist=='keep')
trips_to_remove <- trips_keep_remove_portdist %>%
filter(keep_remove_portdist=='remove')
For 2017, if we filter using this criterion, we retain 18275 trips, and remove 984 trips, which is about 5.11 percent of all trips.
retained_trips_plot <- trips_to_keep %>%
select(Rec_ID, pacfin_port_code) %>%
distinct() %>%
ggplot(aes(x=pacfin_port_code)) +
geom_bar() +
labs(x="Port",y="Number of Unique Trips",title="Retained Trips Ending <= 50km from a Port")+
theme_minimal()+
theme(axis.text.x = element_text(angle=90,vjust=0.5))
retained_trips_plot
trips_removed_plot <- trips_to_remove %>%
select(Rec_ID, pacfin_port_code) %>%
distinct() %>%
ggplot(aes(x=pacfin_port_code)) +
geom_bar() +
labs(x="Port",y="Number of Trips Removed",title="Removed Trips Ending > 50km from a Port")+
theme_minimal()+
theme(axis.text.x = element_text(angle=90,vjust=0.5))
trips_removed_plot
Sort the VMS and fish ticket data by the output of this filtering exercise.
Default behavior for calculations that return NA is to
keep the record.
vms_wfilters <- vms %<>%
left_join(trips_keep_remove_portdist %>% select(Rec_ID,keep_remove_portdist),by='Rec_ID') %>%
distinct(VMS_RECNO,UTCDATETIME,.keep_all = T) %>%
mutate(keep_remove_portdist=replace_na(keep_remove_portdist,'keep'))
We calculate speeds by lagging observations (grouped by vessel) by 1 time step, then calculating distance and dividing by time. Note, then, that these are looking backwards, not forwards. In other words, the distance and time duration for a given VMS ping are the values associated with the segment ending at that ping.
vms_speeds <- vms_wfilters %>%
ungroup() %>%
group_by(drvid) %>%
# lag latitude and longitude by 1 time step
mutate(laglon=lag(LON,1,order_by=westcoastdate),laglat=lag(LAT,1,order_by=westcoastdate)) %>%
# lag time by 1 time step
mutate(lagdate=lag(westcoastdate,1,order_by=westcoastdate)) %>%
# calculate duration since last ping, in seconds
mutate(segment_dur=as.duration(lagdate %--% westcoastdate)/dseconds()) %>%
ungroup()
# Calculate distance (Note: geosphere seems much faster than doing this with sf())
segment_dists <- geosphere::distHaversine(p1=cbind(vms_speeds$LON, vms_speeds$LAT),
p2=cbind(vms_speeds$laglon, vms_speeds$laglat))
vms_speeds %<>%
mutate(segment_dist=segment_dists)
# Speed is just segment distance (default in meters) divided by segment duration (in seconds)
vms_speeds %<>%
mutate(avg_speed_recalc=segment_dist/segment_dur) %>%
# some calculations will be NaN or Inf because of 0 distance or time. Fix these as zeroes
mutate(avg_speed_recalc=ifelse(segment_dist==0|segment_dur==0,0,avg_speed_recalc))
Using a subsample of 100 thousand records here.
set.seed(0401)
vms_subsample <- vms_speeds %>%
sample_n(100000) %>%
mutate(speed_diff=avg_speed_recalc-AVG_SPEED)
vms_subsample %>%
ggplot(aes(AVG_SPEED,avg_speed_recalc))+
geom_point()+
geom_smooth(method='lm')+
coord_equal(xlim=c(0,500),ylim=c(0,500))+
labs(x='Original Average Speed',y="Average Speed Recalculated")
Zoom in a bit more
vms_subsample %>%
ggplot(aes(AVG_SPEED,avg_speed_recalc))+
geom_point()+
geom_smooth(method='lm')+
coord_equal(xlim=c(0,50),ylim=c(0,50))+
labs(x='Original Average Speed',y="Average Speed Recalculated")
Distribution of differences
vms_subsample %>%
mutate(speed_diff=avg_speed_recalc-AVG_SPEED) %>%
ggplot(aes(speed_diff))+
geom_density(fill='seagreen',col=NA)+
geom_vline(xintercept=0,linetype=2)+
labs(x="Calculated minus Reported Speed (m/s)",y='kernel density')+
xlim(-5,5)
82.39 percent of our re-calculations are within 1 m/s of the reported average speeds. Beyond that, it seems that the reported average speeds tend on average to be greater than our calculated speeds.
max_speed <- 20 # meters per second
This time, indicate whether our recalculated speed is greater than 20
m/s. Add these indicators to the master, matched dataset. Default
behavior when calculation returns NA is to keep the
record.
vms_wfilters <- vms_speeds %>% mutate(keep_remove_speed=ifelse(avg_speed_recalc>max_speed,'remove','keep')) %>%
mutate(keep_remove_speed=replace_na(keep_remove_speed,"keep"))
If we filter using the speed criterion, we would remove 0.04 percent of all observations for 2017.
Add another keep_remove indicator showing whether a
given VMS ping may be on land.
vms_wfilters %<>% mutate(keep_remove_bathy=ifelse(NGDC_M<=0,"keep","remove"))
If we filter using the on-land criterion, we would remove 20.66 percent of all observations for 2017.
In this piece, we flag all “in-port” VMS records, using a buffer zone of 1.5km or 3km, depending on the port. In-port records that also have an average speed of < 1 m/s are marked for removal. Only those records marked for removal before and after the fishing trip were actually removed from the filtered output data.
## width of buffer circle (in meters). applies to all ports but those that require smaller buffers
r = 3000
## port codes for any ports that require reduced buffer sizes. Default is only COS
ports_lowbuffer <- c("COS")
## width of buffer circle (in meters) for reduced buffer size
r2 = 1500
## cutoff value for speed (m/s) -- 1m/s ~ 2 knots
speed_cutoff <- 1
Using port coordinates from above, indicated whether they are in the “smaller buffer” category
portlist_coords %<>%
mutate(small_buffer=ifelse(port_code %in% ports_lowbuffer,1,0) %>% as.logical())
# convert ports to sf() object and buffer
ports_sf <- portlist_coords %>%
st_as_sf(coords=c('portlon','portlat'),crs=4326) %>%
#project to UTM zone 10
st_transform(crs = "+proj=utm +north +zone=10 +ellps=WGS84")
# buffer large ports
large_ports_buffer <- ports_sf %>%
filter(!small_buffer) %>%
st_buffer(r)
# buffer small ports
small_ports_buffer <- ports_sf %>%
filter(small_buffer) %>%
st_buffer(r2)
ports_buffer <- rbind(large_ports_buffer,small_ports_buffer)
# import a background/land layer from rnaturalearth package
coaststates <- ne_states(country='United States of America',returnclass = 'sf') %>%
filter(name %in% c('California','Oregon','Washington')) %>%
# make sure CRS is the same as the port layer
st_transform(st_crs(ports_buffer))
# sections of coastline
cutcoast <- list(c(-124, -125, 47.5, 48.5), #upper WA
c(-123.5, -124.5, 46.7, 47.7), #lower WA p1
c(-123.5, -124.5, 45.7, 46.7), #lower WA / upper OR [WLB, NHL]
c(-123.5, -124.5,45.6, 44.6), # OR [TLL, NEW]
c(-123.5, -124.5,44.5, 43.5), # OR [WLD, WIN]
c(-124, -125, 43.5, 42.25), # OR [COS,GLD]
c(-123.75, -124.75, 42.2, 40.5), #OR to CA [BRK, FLN]
c(-123, -124, 40, 38.5), # CA [BRG, ARE]
c(-122, -123.5, 38.5, 36.6), # CA [BDG, CRZ]
c(-120.5, -122, 37, 35), # CA [MOS, AVL]
c(-117, -120, 35, 32.5)) # CA[SB, OCN]
# Plot
buffer_plots <- purrr::map(1:length(cutcoast), function(i){
seg=cutcoast[[i]]
bx=c(xmin=seg[2],xmax=seg[1],ymin=seg[3],ymax=seg[4])%>% st_bbox(crs=st_crs(4326))
bbox <- bx %>% st_as_sfc() %>% st_transform(st_crs(ports_buffer)) %>% st_bbox()
plotout <- ggplot()+
geom_sf(data=coaststates,fill='gray50')+
geom_sf(data=ports_buffer,aes(fill=small_buffer),alpha=0.5)+
geom_sf_text(data=ports_buffer,aes(label=port_code),size=3,hjust=1)+
xlim(bbox[1],bbox[3])+ylim(bbox[2],bbox[4])+
labs(x='',y='',title="Port Buffers",fill="Small Buffer")+
theme(axis.text.x = element_text(angle=90))
print(plotout)
})
Calculate whether points overlap buffers
vms_sf <- vms_wfilters %>%
select(VMS_RECNO,X_COORD,Y_COORD) %>%
st_as_sf(coords=c("X_COORD","Y_COORD"),crs = "+proj=utm +north +zone=10 +ellps=WGS84")
# do the overlay, which results in a list of buffer indices into which a VMS point falls
#https://github.com/r-spatial/sf/wiki/migrating
port_names <- ports_buffer$port_code
# vms_ports_over_list <- st_intersects(vms_sf,ports_buffer)
vms_ports_over <- sapply(st_intersects(vms_sf,ports_buffer), function(z) if (length(z)==0) NA_integer_ else z[1])
# Overlay returned integers. Pull out the appropriate port names from the list of ports
vms_ports_over_names <- port_names[vms_ports_over]
Add the port names back to the master dataset for each VMS point.
vms_wfilters %<>%
mutate(in_port=vms_ports_over_names)
## Plot a sample of VMS points
vms_sf %<>%
mutate(in_port=vms_ports_over_names)
test_port<-vms_sf %>%
filter(in_port=='BDG') %>%
#subsample for quicker plotting
sample_n(.,min(nrow(.),1000))
testbbox <- st_bbox(ports_buffer %>% filter(port_code=='BDG'))
# testbbox <- st_bbox(test_port)
ggplot()+
geom_sf(data=coaststates,fill='gray50')+
geom_sf(data=ports_buffer,fill='seagreen',alpha=0.5)+
geom_sf(data=test_port,size=1)+
xlim(testbbox[1],testbbox[3])+ylim(testbbox[2],testbbox[4])+
labs(x='',y='',title="Points within Bodega Buffer")+
theme(axis.text.x = element_text(angle=90))
What is the avg speed when vessels are marked as in port?
inport_dat <- vms_wfilters %>% mutate(in_port_binary = ifelse(is.na(in_port), FALSE,TRUE))
ggplot(data=filter(inport_dat, avg_speed_recalc < 50), aes(x=avg_speed_recalc,fill=factor(in_port_binary))) +
geom_histogram() +
facet_wrap(~in_port_binary)+
labs(x="Average Speed (Calculated)",y="Number of Records",fill="In Port")
Mark “remove” if the record is in port and the avg speed is < 1 m/s. For this filter, we use our recalculated speeds from the previous step.
vms_wfilters %<>%
mutate(keep_remove_inport = ifelse(!is.na(in_port) & avg_speed_recalc < speed_cutoff, "remove", "keep"))
round(sum(vms_wfilters$keep_remove_inport=='remove',na.rm=T)/nrow(vms_wfilters)*100,2)
## [1] 66.28
If we filter using the in-port criterion, we would remove 66.28 percent of all observations for 2017.
We want to remove all ‘remove’ points except the last in-port record before the fishing trip and the first in-port record after the fishing trip. For now, we will retain all mid-trip “remove” points.
We find these records for each trip. We sort by date and by whether
the vessel is in port. We can then sort all of the records that are
before or after the fishing trip. We add a final
keep_remove_ column that indicates whether a VMS ping is
truly part of a “trip” according to the criterion described here.
As a default, we retain pings. That is, if a trip does not have a first or last in-port indicator, it is marked as in-trip
vms_wfilters %<>%
# organize by trip and date
group_by(Rec_ID) %>%
arrange(westcoastdate) %>%
# indicate the dates of the first and last non-in-port records
mutate(first_keep=first(westcoastdate[keep_remove_inport=='keep']),
last_keep=last(westcoastdate[keep_remove_inport=='keep'])) %>%
# indicate the dates of the last pre-trip in-port record and the first post-trip in-port record
mutate(last_pretrip=last(westcoastdate[westcoastdate<first_keep]),
first_posttrip=first(westcoastdate[westcoastdate>last_keep])) %>%
# add a filtering column indicating whether a ping is part of the trip
ungroup() %>%
mutate(keep_remove_not_intrip=ifelse(westcoastdate >= last_pretrip & westcoastdate <= first_posttrip,'keep','remove')) %>%
# change NAs to default behavior of keep
mutate(keep_remove_not_intrip=replace_na(keep_remove_not_intrip,'keep'))
For a random set of 5 trips, view the results of these filters.
set.seed(0401)
random_trips <- vms_wfilters %>% filter(TARGET_rev %in% spp_codes) %>% distinct(Rec_ID) %>% sample_n(5) %>% pull(Rec_ID)
testtrips <- vms_wfilters %>%
filter(Rec_ID %in% random_trips) %>%
select(Rec_ID,VMS_RECNO,X_COORD,Y_COORD,contains('keep_remove')) %>%
select(-keep_remove_inport) %>%
#pivot to indicate, if a point was removed, which step it was removed in
pivot_longer(contains('keep_remove'),names_to = 'category',values_to = 'keep_remove') %>%
group_by(VMS_RECNO) %>%
# pull out the FIRST reason why a point is flagged for removal (port distance, speed, bathymetry, or in-port)
mutate('reason_removed'=first(category[keep_remove=='remove'],default='kept')) %>%
ungroup() %>%
mutate(reason_removed=str_replace(reason_removed,'keep_remove_','')) %>%
mutate(reason_removed=as.factor(reason_removed)) %>%
select(Rec_ID,VMS_RECNO,X_COORD,Y_COORD,reason_removed) %>%
distinct() %>%
#convert to spatial
st_as_sf(coords=c("X_COORD","Y_COORD"),crs = "+proj=utm +north +zone=10 +ellps=WGS84") %>%
# jitter a tiny bit
st_jitter()
#plot
testtrip_plots <- purrr::map(random_trips, function(x){
testtrip <- testtrips %>% filter(Rec_ID==x)
bbox <- st_bbox(testtrip)
plotout <- ggplot()+
geom_sf(data=coaststates,fill='gray50')+
geom_sf(data=ports_buffer,fill='purple',alpha=0.2)+
geom_sf(data=testtrip,aes(col=reason_removed),key_glyph='point',alpha=1)+
xlim(bbox[1],bbox[3])+ylim(bbox[2],bbox[4])+
labs(x="",y="",color="Reason\nRemoved")+
scale_color_discrete(drop=FALSE)+
theme(axis.text.x = element_text(angle=90))
print(plotout)
})
We now have a list of all VMS points with indicators of whether they are/will be filtered out because of the four reasons described above (port distance of last ping, ludicrous speeds, positive bathymetry, and/or between-trip pings within ports).
We can organize these reasons and look at the overall effects of filtering.
# on the individual ping level, measure reasons for removal
vms_dataloss <- vms_wfilters %>%
select(Rec_ID,VMS_RECNO,X_COORD,Y_COORD,contains('keep_remove')) %>%
select(-keep_remove_inport) %>%
distinct() %>%
#pivot to indicate, if a point was removed, which step it was removed in
pivot_longer(contains('keep_remove'),names_to = 'category',values_to = 'keep_remove') %>%
group_by(VMS_RECNO) %>%
# pull out the FIRST reason why a point is flagged for removal (port distance, speed, bathymetry, or in-port)
mutate('reason_removed'=first(category[keep_remove=='remove'],default='kept'),
num_removals=sum(keep_remove=='remove',na.rm=T)) %>%
ungroup() %>%
mutate(reason_removed=str_replace(reason_removed,'keep_remove_','')) %>%
select(Rec_ID,VMS_RECNO,X_COORD,Y_COORD,reason_removed,num_removals) %>%
distinct()
vms_dataloss_table <- vms_dataloss %>%
mutate(total_records=n()) %>%
group_by(reason_removed,total_records) %>%
summarise(n_removed=n()) %>%
ungroup() %>%
filter(reason_removed!='kept') %>%
mutate(stepnum=case_when(
reason_removed=="portdist" ~ 1,
reason_removed=="speed" ~2,
reason_removed=="bathy"~3,
reason_removed=="not_intrip"~4
)) %>%
arrange(stepnum) %>%
mutate(tot_removed=cumsum(n_removed)) %>%
mutate(n_left=total_records-tot_removed) %>%
mutate(prop_remain=1-tot_removed/total_records) %>%
add_row(reason_removed='start',stepnum=0,prop_remain=1)
vms_dataloss_table %>%
ggplot(aes(stepnum,prop_remain))+
geom_point()+
geom_line()+
scale_y_continuous(limits=c(0,1))+
labs(x="Filtering Step",y="Proportion Remaining Observations")+
theme(panel.grid.major = element_line())
vms_dataloss_table %>%
ggplot(aes(reorder(reason_removed,stepnum),prop_remain))+
geom_col(width=0.5)+
scale_y_continuous(limits=c(0,1))+
labs(x="Filtering Step",y="Proportion Remaining Observations")+
theme(panel.grid.major = element_line())
Have we removed > 99% of records from any trips?
check_removes <- vms_dataloss %>%
group_by(Rec_ID) %>%
mutate(remove=num_removals>0) %>%
summarise(prop_removed=sum(remove)/n()) %>%
ungroup()
1114 trips were removed entirely, and 1124 had more than 99 percent of their records removed, out of 19641 total trips.
Finally, we do the actual filtering. We save a filtered version of the data for the next steps, but retain the full VMS-matched dataset with the filtering tags we’ve produced here, for later QA/QC.
vms_filtered <- vms_dataloss %>%
select(VMS_RECNO,num_removals) %>%
right_join(vms_wfilters,by=join_by(VMS_RECNO)) %>%
mutate(remove=num_removals>0) %>%
#final filter
filter(!remove) %>%
distinct(Rec_ID,UTCDATETIME,.keep_all = T) %>%
# clean up, removing filtering columns
select(VMS_RECNO:multispecies,segment_dur,avg_speed_recalc,in_port) %>%
select(-num_removals)
Save the output and the full, non-filtered data set
#Save version with FTID and length
write_rds(vms_filtered,here::here('Confidential','processed','matched','filtering',paste0(process_year,'matched_filtered_withFTID_length.rds')))
write_rds(vms_wfilters,here::here('Confidential','processed','matched','filtering',paste0(process_year,'matched_unfiltered.rds')))
# distribution of segment duration? in minutes
vms_filtered %>% ggplot(aes(segment_dur))+geom_density()+xlim(0,100)
x<-proc.time()-alltime
So far, this pipeline for 2017 VMS data has taken 13.52 minutes to run.